Attribute identification device, attribute identification method, and program

ABSTRACT

An attribute identification technology that can reject an attribute identification result if the reliability thereof is low is provided. An attribute identification device includes: a posteriori probability calculation unit  110  that calculates, from input speech, a posteriori probability sequence {q(c, i)} which is a sequence of the posteriori probabilities q(c, i) that a frame i of the input speech is a class c; a reliability calculation unit  120  that calculates, from the posteriori probability sequence {q(c, i)}, reliability r(c) indicating the extent to which the class c is a correct attribute identification result; and an attribute identification result generating unit  130  that generates an attribute identification result L of the input speech from the posteriori probability sequence {q(c, i)} and the reliability r(c). The attribute identification result generating unit  130  obtains a most probable estimated class c{circumflex over ( )}, which is a class that is estimated to be the most probable attribute, from the posteriori probability sequence {q(c, i)} and sets ϕ indicating rejection as the attribute identification result L if the reliability r(c{circumflex over ( )}) of the most probable estimated class c{circumflex over ( )} falls within a predetermined range indicating that the reliability r(c{circumflex over ( )}) is low and sets the most probable estimated class c{circumflex over ( )} as the attribute identification result L otherwise.

TECHNICAL FIELD

The present invention relates to a technology for identifying anattribute of a speaker based on uttered speech.

BACKGROUND ART

A technology for identifying an attribute (for example, gender or an agebracket) based on speech is needed for the purpose of gatheringmarketing information by a voice interactive robot or in a call center.As the existing technology for attribute identification, there are amethod for identifying an attribute by using a Gaussian mixture model(GMM) (Non-patent Literature 1), a method for identifying an attributeby i-vectors extracted from speech by using a support vector machine(SVM), and the like.

With these existing technologies, an attribute is sometimes erroneouslyidentified due to the influence of ambient noise. In particular, when aradio broadcast, a television broadcast, or the like including speech ormusic is superimposed on uttered speech as noise (hereinafter alsoreferred to as television noise), a plurality of types of speech arepresent. In this case, it is difficult to differentiate between theuttered speech and the speech included in the television noise, whichresults in erroneous identification of an attribute.

For this reason, a method for implementing robust attributeidentification by performing machine learning on speech on which noiseis superimposed in advance is also proposed (Non-patent Literature 2).

PRIOR ART LITERATURE Non-Patent Literature

Non-patent Literature 1: Shoko Miyamori, Ryuichi Nishimura, RisaKurihara, Toshio Irino, Hideki Kawahara, “An investigation of child useridentification based on speech recognition of a short sentence”, FIT(The Institute of Electronics, Information and Communication Engineersand Information Processing Society of Japan) Steering Committee,Proceedings of Forum on Information Technology 9(3), pp. 469-472, 2010.

-   Non-patent Literature 2: Satoshi Nakamura, “Towards Robust Speech    Recognition in Real Acoustic Environments”, The Institute of    Electronics, Information and Communication Engineers, Technical    report of IEICE, EA2002-12, SP2002-12, pp. 31-36, 2002.

SUMMARY OF THE INVENTION Problems to be Solved by the Invention

However, since the variety of noise conditions that occur due to theinfluence of speech or music included in television noise is quite wide,it is impossible to perform exhaustive learning so as to achieve arobust operation under any noise conditions. Moreover, if learning isperformed by using learned data limited to some noise conditions, speechincluded in noise is learned as a feature for attribute identification,which can actually cause an error in identification which is performedunder low-noise conditions. Therefore, in view of the degree ofsatisfaction of the user (hereinafter referred to as usability), it isbetter to reject the erroneous identification result than to provide theresult; however, giving an attribute identification result in a uniformmanner causes a problem of impaired usability.

An object of the present invention is accordingly to provide anattribute identification technology that can reject an attributeidentification result if the reliability of the attribute identificationresult is low.

Means to Solve the Problems

An aspect of the present invention is an attribute identificationdevice, in which, if I is assumed to be an integer greater than or equalto 0 and a set of classes for identifying a speaker of uttered speech isassumed to be an attribute, the attribute identification deviceincludes: a posteriori probability calculation unit that calculates,from input speech s(t), a posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I) which is a sequence of the posteriori probabilitiesq(c, i) that a frame i of the input speech s(t) is a class c; areliability calculation unit that calculates, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c)indicating the extent to which the class c is a correct attributeidentification result; and an attribute identification result generatingunit that generates an attribute identification result L of the inputspeech s(t) from the posteriori probability sequence {q(c, i)} (i=0, 1,. . . , I) and the reliability r(c). The attribute identification resultgenerating unit obtains a most probable estimated class c{circumflexover ( )}, which is a class that is estimated to be the most probableattribute, from the posteriori probability sequence {q(c, i)} (i=0, 1, .. . , I) and sets ϕ indicating rejection as the attribute identificationresult L if the reliability r(c{circumflex over ( )}) of the mostprobable estimated class c{circumflex over ( )} falls within apredetermined range indicating that the reliability r(c{circumflex over( )}) is low and sets the most probable estimated class c{circumflexover ( )} as the attribute identification result L otherwise.

An aspect of the present invention is an attribute identificationdevice, in which, if I is assumed to be an integer greater than or equalto 0 and a set of classes for identifying a speaker of uttered speech isassumed to be an attribute, the attribute identification deviceincludes: a posteriori probability calculation unit that calculates,from input speech s(t), a posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I) which is a sequence of the posteriori probabilitiesq(c, i) that a frame i of the input speech s(t) is a class c; areliability calculation unit that calculates, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c)indicating the extent to which the class c is a correct attributeidentification result; and an attribute identification result generatingunit that generates an attribute identification result L of the inputspeech s(t) from the reliability r(c). The attribute identificationresult generating unit obtains a most probable estimated classc{circumflex over ( )}, which is a class that is estimated to be themost probable attribute, from the reliability r(c) and sets ϕ indicatingrejection as the attribute identification result L if the reliabilityr(c{circumflex over ( )}) of the most probable estimated classc{circumflex over ( )} falls within a predetermined range indicatingthat the reliability r(c{circumflex over ( )}) is low and sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L otherwise.

An aspect of the present invention is an attribute identificationdevice, in which, if I is assumed to be an integer greater than or equalto 0 and a set of classes for identifying a speaker of uttered speech isassumed to be an attribute, the attribute identification deviceincludes: a posteriori probability calculation unit that calculates,from input speech s(t), a posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I) which is a sequence of the posteriori probabilitiesq(c, i) that a frame i of the input speech s(t) is a class c; and anattribute identification result generating unit that generates anattribute identification result L of the input speech s(t) from theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I). Theattribute identification result generating unit includes a reliabilitycalculation unit that calculates, from the posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c) indicating theextent to which the class c is a correct attribute identificationresult, and obtains a most probable estimated class c{circumflex over( )}, which is a class that is estimated to be the most probableattribute, from the posteriori probability sequence {q(c, i)} (i=0, 1, .. . , I), calculates the reliability r(c{circumflex over ( )}) of themost probable estimated class c{circumflex over ( )} by using thereliability calculation unit, and sets ϕ indicating rejection as theattribute identification result L if the reliability r(c{circumflex over( )}) falls within a predetermined range indicating that the reliabilityr(c{circumflex over ( )}) is low and sets the most probable estimatedclass c{circumflex over ( )} as the attribute identification result Lotherwise.

Effects of the Invention

According to the present invention, it is possible to prevent impairmentof usability by rejecting an attribute identification result if thereliability thereof, which indicates the certainty of the attributeidentification result, is low.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example of the configurationof an attribute identification device 100.

FIG. 2 is a flowchart showing an example of the operation of theattribute identification device 100.

FIG. 3A is a diagram showing an example of time variations in posterioriprobability and reliability.

FIG. 3B is a diagram showing an example of time variations in posterioriprobability and reliability.

FIG. 4 is a block diagram illustrating an example of the configurationof an attribute identification device 101.

FIG. 5 is a flowchart showing an example of the operation of theattribute identification device 101.

FIG. 6 is a block diagram illustrating an example of the configurationof an attribute identification device 102.

FIG. 7 is a flowchart showing an example of the operation of theattribute identification device 102.

FIG. 8 is a block diagram illustrating an example of the configurationof a reliability calculation model learning device 200.

FIG. 9 is a flowchart showing an example of the operation of thereliability calculation model learning device 200.

FIG. 10A is a diagram showing an example of time variations inposteriori probability.

FIG. 10B is a diagram showing an example of time variations inposteriori probability.

FIG. 10C is a diagram showing an example of time variations inposteriori probability.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Hereinafter, embodiments of the present invention will be described indetail. It is to be noted that component units having the same functionwill be identified with the same reference numeral and overlappingexplanations will be omitted.

Definition

Hereinafter, terms which are used in the embodiments will be described.

Speech s(t) is an amplitude in a sampling time t (t=0, 1, . . . ,T_(k)−1, where T_(k) is an integer greater than or equal to 1) when thesampling frequency is assumed to be f_(s) [Hz]. Moreover, a featureamount x(i) is a feature amount which is extracted from a frame i (i=0,1, . . . , I, where I is an integer greater than or equal to 0 and I+1represents the number of frames generated from the speech s(t)) of thespeech s(t). For example, assume that the mel-frequency cepstralcoefficient (MFCC) or fundamental frequency is used as a feature amount;then, the feature amount can be extracted by setting appropriateanalysis frame width and frame shift (for instance, by setting theanalysis frame width to 50 ms and the frame shift to 25 ms).

An attribute is a set of classes (attribute values) for identifying aspeaker of uttered speech. For example, for an attribute “gender”,“male” and “female” are provided as classes. For an attribute “agebracket”, “teens”, “twenties”, “thirties”, and the like are provided asclasses. Moreover, an attribute obtained by combining gender and an agebracket may be used; in this case, for example, “adult male”, “adultfemale”, “child”, and the like can be used as classes. In general, aclass (an attribute value) is expressed as c (c=0, 1, C, where C is aninteger greater than or equal to 0 and C+1 represents the number ofclasses). For instance, an attribute value c in gender identificationonly has to represent “male” when c=0 and “female” when c=1.Furthermore, an attribute identification model λ_(c) is a model thatoutputs, by using the feature amount x(i) of a frame i as input, theposteriori probability p(c|x(i)) (c=0, 1, C) that a class is c when thefeature amount is x(i). The attribute identification model λ_(c) can beimplemented by using, for example, a neural network such as a deepneural network (DNN).

First Embodiment

Hereinafter, an attribute identification device 100 will be describedwith reference to FIGS. 1 and 2. FIG. 1 is a block diagram illustratingthe configuration of the attribute identification device 100. FIG. 2 isa flowchart showing the operation of the attribute identification device100. As illustrated in FIG. 1, the attribute identification device 100includes a posteriori probability calculation unit 110, a reliabilitycalculation unit 120, an attribute identification result generating unit130, and a recording unit 190. The recording unit 190 is a componentunit on which information necessary for processing which is performed bythe attribute identification device 100 is appropriately recorded. Forexample, a threshold δ which is used by the attribute identificationresult generating unit 130 is recorded on the recording unit 190 inadvance.

Moreover, the attribute identification device 100 reads data of anattribute identification model 930 as appropriate and executesprocessing. FIG. 1 is a diagram in which the attribute identificationmodel 930 is recorded on an external recording unit; the attributeidentification model 930 may be configured so as to be recorded on therecording unit 190 included in the attribute identification device 100.Hereinafter, in the present embodiment, the attribute identificationmodel 930 recorded on an external recording unit and the attributeidentification model 930 recorded on the recording unit 190 are notdifferentiated from one another and are expressed as the attributeidentification model λ_(c).

The attribute identification device 100 generates, from input speechs(t), an attribute identification result L which is the result ofidentification of an attribute of a speaker of the input speech s(t) andoutputs the attribute identification result L.

The operation of the attribute identification device 100 will bedescribed in accordance with FIG. 2. The posteriori probabilitycalculation unit 110 calculates, from the input speech s(t), aposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I) which is asequence of the posteriori probabilities q(c, i) that a frame i of theinput speech s(t) is a class c (S110). Specifically, by using theattribute identification model λ_(c), the posteriori probabilitycalculation unit 110 obtains the posteriori probability p(c|x(i)) that afeature amount x(i) extracted from a frame i of the input speech s(t) isa class c and sets q(c, i)=p(c|x(i)). Here, 0≤q(c, i)≤1 (c=0, 1, . . . ,C; 1=0, 1, . . . , I) and Σ_(c)q(c, i)=1 (i=0, 1, . . . , I) hold.

The reliability calculation unit 120 calculates the reliability r(c) ofthe class c from the posteriori probability sequence {q(c, i)} (i=0, 1,. . . , I) of the class c (S120). Here, the reliability r(c) of theclass c is a value indicating the extent to which the class c is acorrect attribute identification result, and the reliability r(c) isdefined as an index that satisfies 0≤r(c)≤1 and indicates that thecloser the reliability r(c) is to 1, the more certain the attributeidentification result is. For example, the reliability r(c) may bedefined, as expressed in the following formula, as the average ofposteriori probabilities for each class.

$\begin{matrix}{{r(c)} = {\frac{1}{I + 1}{\sum\limits_{i = 0}^{I}{q\left( {c,\ i} \right)}}}} & (1)\end{matrix}$

Moreover, the reliability r(c) may be defined, as expressed in thefollowing formula, by using the product of posteriori probabilities foreach class.

$\begin{matrix}{{{r(c)} = \frac{\; {\overset{I}{\prod\limits_{i = 0}}{q\left( {c,i} \right)}}}{\sum\limits_{c^{\prime} = 0}^{C}{\prod\limits_{i = 0}^{I}{q\left( {c^{\prime},i} \right)}}}}\mspace{14mu}} & (2)\end{matrix}$

When the reliability r(c) is defined by using Formula (2), the value ofr(c) is close to 1 (for example, 0.9999) for almost every input speech,which sometimes requires fine settings of the threshold δ, based onwhich a determination whether or not to reject a most probable estimatedclass c{circumflex over ( )} is made, in the attribute identificationresult generating unit 130. For this reason, the reliability r(c) may bedefined, as expressed in the following formula, by using a formulax^(v), which gradually changes between 0 and 1, using an appropriateparameter v (0<v<1).

$\begin{matrix}{{r(c)} = \frac{\left\{ {\prod\limits_{i = 0}^{I}{q\left( {c,i} \right)}} \right\}^{v}}{\sum\limits_{c^{\prime} = 0}^{C}\left\{ {\prod\limits_{i = 0}^{I}{q\left( {c^{\prime},i} \right)}} \right\}^{v}}} & (3)\end{matrix}$

The attribute identification result generating unit 130 generates anattribute identification result L of the input speech s(t) from theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , 1) of theclass c and the reliability r(c) of the class c (S130). Specifically,the attribute identification result generating unit 130 first obtains,from the posteriori probability sequence {q(c, i)} (i=0, 1, . . . , I),a most probable estimated class c{circumflex over ( )} by the followingformula. As is clear from the formula, the most probable estimated classis a class which is estimated to be the most probable attribute.

$\begin{matrix}{\overset{\hat{}}{c} = \; {\underset{c}{\arg \mspace{14mu} \max}\left( {\sum\limits_{i = 0}^{I}{\log \; {q\left( {c,i} \right)}}} \right)}} & (4)\end{matrix}$

Next, the attribute identification result generating unit 130 comparesthe reliability r(c{circumflex over ( )}) with the threshold δ (0<δ<1).If r(c{circumflex over ( )})≥δ (or r(c{circumflex over ( )})>8), theattribute identification result generating unit 130 sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L; if r(c{circumflex over ( )})<8 (orr(c{circumflex over ( )})≤δ), the attribute identification resultgenerating unit 130 rejects the most probable estimated classc{circumflex over ( )} and sets ϕ, which indicates rejection, as theattribute identification result L.

It is to be noted that a case in which r(c{circumflex over ( )})<8 orr(c{circumflex over ( )})≤δ is referred to as a case in which thereliability r(c{circumflex over ( )}) falls within a predetermined rangeindicating that the reliability r(c{circumflex over ( )}) is low.

FIGS. 3A and 3B show time variations in posteriori probability and timevariations in reliability which is defined by Formula (3) on theassumption that v=1/32. FIG. 3A shows variations in posterioriprobability and reliability in the presence of input speech alone, andFIG. 3B shows variations in posteriori probability and reliability inthe presence of input speech on which television noise is superimposed.It is clear that, in the presence of input speech alone, a class whosereliability eventually has a value close to 1 appears when the inputspeech gets to a certain length, whereas, in the presence of inputspeech on which television noise is superimposed, each class tends tohave a value lower than the corresponding value in the presence of inputspeech alone and no class has a value close to 1. Since reliability hassuch a feature, if the reliability of a most probable estimated classdoes not reach the predetermined threshold δ, it is possible to rejectthe most probable estimated class as a low-reliability class which maybe erroneously identified.

(First Modification)

The attribute identification device 100 is configured so as to use theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I) as inputto the attribute identification result generating unit 130;alternatively, the attribute identification device 100 may be configuredso as to generate the attribute identification result L without usingthe posteriori probability sequence {q(c, i)} (i=0, 1, . . . , I).Hereinafter, an attribute identification device 101 will be describedwith reference to FIGS. 4 and 5. FIG. 4 is a block diagram illustratingthe configuration of the attribute identification device 101. FIG. 5 isa flowchart showing the operation of the attribute identification device101. As illustrated in FIG. 4, the attribute identification device 101includes the posteriori probability calculation unit 110, thereliability calculation unit 120, an attribute identification resultgenerating unit 131, and the recording unit 190.

The operation of the attribute identification device 101 will bedescribed in accordance with FIG. 5. The posteriori probabilitycalculation unit 110 calculates, from input speech s(t), a posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I) which is a sequenceof the posteriori probabilities q(c, i) that a frame i of the inputspeech s(t) is a class c (S110). The reliability calculation unit 120calculates the reliability r(c) of the class c from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , 1) of the class c(S120).

The attribute identification result generating unit 131 generates anattribute identification result L of the input speech s(t) from thereliability r(c) of the class c (S131). Specifically, the attributeidentification result generating unit 131 first obtains a most probableestimated class c{circumflex over ( )} from the reliability r(c) of theclass c by the following formula.

$\begin{matrix}{\overset{\hat{}}{c} = {\underset{c}{\arg \mspace{14mu} \max}\left( {r(c)} \right)}} & (5)\end{matrix}$

Next, the attribute identification result generating unit 131 comparesthe reliability r(c{circumflex over ( )}) with the threshold δ (0<δ<1).If r(c{circumflex over ( )})≥δ (or r(c{circumflex over ( )})>δ), theattribute identification result generating unit 131 sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L; if r(c{circumflex over ( )})<δ (orr(c{circumflex over ( )})≥δ), the attribute identification resultgenerating unit 131 rejects the most probable estimated classc{circumflex over ( )} and sets ϕ, which indicates rejection, as theattribute identification result L.

(Second Modification)

Moreover, the attribute identification device 100 is configured so as touse the reliability r(c) as input to the attribute identification resultgenerating unit 130; alternatively, the attribute identification device100 may be configured so as to use only the posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I) as input. In this case,calculation of reliability is performed only on a most probableestimated class. Hereinafter, an attribute identification device 102will be described with reference to FIGS. 6 and 7. FIG. 6 is a blockdiagram illustrating the configuration of the attribute identificationdevice 102. FIG. 7 is a flowchart showing the operation of the attributeidentification device 102. As illustrated in FIG. 6, the attributeidentification device 102 includes the posteriori probabilitycalculation unit 110, an attribute identification result generating unit132, and the recording unit 190.

The operation of the attribute identification device 102 will bedescribed in accordance with FIG. 7. The posteriori probabilitycalculation unit 110 calculates, from input speech s(t), a posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I) which is a sequenceof the posteriori probabilities q(c, i) that a frame i of the inputspeech s(t) is a class c (S110).

The attribute identification result generating unit 132 generates anattribute identification result L of the input speech s(t) from theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I) of theclass c (S132). Specifically, the attribute identification resultgenerating unit 132 first obtains a most probable estimated classc{circumflex over ( )} from the posteriori probability sequence {q(c,i)} (i=0, 1, . . . , I) by Formula (4). Next, the attributeidentification result generating unit 132 calculates the reliabilityr(c{circumflex over ( )}) of the most probable estimated classc{circumflex over ( )}. For this calculation, Formulae (1) to (3) can beused; for example, the attribute identification result generating unit132 only has to be configured so as to include the reliabilitycalculation unit 120. Finally, the attribute identification resultgenerating unit 132 compares the reliability r(c{circumflex over ( )})with the threshold δ (0<δ<1). If r(c{circumflex over ( )})≥δ (orr(c{circumflex over ( )})>δ), the attribute identification resultgenerating unit 132 sets the most probable estimated class c{circumflexover ( )} as the attribute identification result L; if r(c{circumflexover ( )})<δ (or r(c{circumflex over ( )})≤δ), the attributeidentification result generating unit 132 rejects the most probableestimated class c{circumflex over ( )} and sets ϕ, which indicatesrejection, as the attribute identification result L.

According to the present invention, by rejecting an attributeidentification result if the reliability thereof, which indicates thecertainty of the attribute identification result, is low, it is possibleto prevent impairment of usability and prevent displeasure caused bypresenting an unreliable identification result to the user.

Second Embodiment

In the first embodiment, reliability is calculated by using formulaesuch as Formulae (1) to (3). In a second embodiment, reliability iscalculated by using a reliability calculation model instead of usingformulae. That is, an attribute identification device of the secondembodiment differs from each attribute identification device of thefirst embodiment only in that the reliability r(c) of a class c iscalculated from a posteriori probability sequence {q(c, i)} (i=0, 1, . .. , I) of the class c by using the reliability calculation model. Thisreliability calculation model is learned by a reliability calculationmodel learning device 200. The reliability calculation model is recordedon a recording unit of the attribute identification device before theattribute identification device starts processing.

Hereinafter, the reliability calculation model learning device 200 willbe described with reference to FIGS. 8 and 9. FIG. 8 is a block diagramillustrating the configuration of the reliability calculation modellearning device 200. FIG. 9 is a flowchart showing the operation of thereliability calculation model learning device 200. As illustrated inFIG. 8, the reliability calculation model learning device 200 includes aspeech-with-noise-superimposed-thereon generating unit 210, a posterioriprobability calculation unit 110, an attribute identification resultgenerating unit 230, a reliability label generating unit 240, areliability calculation model learning unit 250, and a recording unit290. The recording unit 290 is a component unit on which informationnecessary for processing which is performed by the reliabilitycalculation model learning device 200 is appropriately recorded. Forexample, a database on a posteriori probability sequence with areliability label, which is used by the reliability calculation modellearning unit 250 for learning, is recorded on the recording unit 290.

Moreover, the reliability calculation model learning device 200 readsdata of each of a speech database 910, a noise database 920, and anattribute identification model 930 as appropriate and executesprocessing. FIG. 8 is a diagram in which each of the speech database910, the noise database 920, and the attribute identification model 930is recorded on an external recording unit; each of the speech database910, the noise database 920, and the attribute identification model 930may be configured so as to be recorded on the recording unit 290included in the reliability calculation model learning device 200.

The speech database 910 is a database made up of speech with anattribute label, which is a tuple of M+1 speech s_(m)(t)=0, 1, M, whereM is an integer greater than or equal to 0) and an attribute label A_(m)of the speech s_(m)(t). The attribute label A_(m) of the speech s_(m)(t)is an attribute value (a class) of a speaker of the speech s_(m)(t) andis a label indicating a right attribute identification result. Moreover,the noise database 920 is a database made up of J+1 noise n_(j)(t) (j=0,1, J, where J is an integer greater than or equal to 0). Each noisen_(j)(t) contained in the noise database 920 includes, for example,speech and music such as actual radio broadcast and televisionbroadcast. The attribute identification model 930 is the attributeidentification model λ_(c) used in the first embodiment.

By using the speech database 910, the noise database 920, and theattribute identification model 930, the reliability calculation modellearning device 200 learns the reliability calculation model thatoutputs the reliability of a class c by using a posteriori probabilitysequence of the class c as input.

The operation of the reliability calculation model learning device 200will be described in accordance with FIG. 9. Thespeech-with-noise-superimposed-thereon generating unit 210 generatesspeech with noise superimposed thereon x_(m)(t) from speech s_(m)(t)(m=0, 1, M) of the speech database 910 and noise n_(j)(t) (j=0, 1, . . ., J) of the noise database 920 (S210). Specifically, thespeech-with-noise-superimposed-thereon generating unit 210 generatesrandom values j, a, and a for each speech s_(m)(t) and generates speechwith noise superimposed thereon x_(m)(t) by the following formula.

x _(m)(t)=s _(m)(t)+αn _(j)(t+a)  (6)

Here, j is an index for selecting noise which is superimposed on speechand 0≤j≤J holds. Moreover, α is an SN ratio and, when the power ofspeech and the power of noise are nearly equal, α can be an SN ratio of−20 to 30 dB, that is, α=10^(−20/10) to 10^(30/10). a is a value forselection of a segment of noise used and only has to be randomlyselected so as not to be longer than the time length of the noisen_(j)(t).

The posteriori probability calculation unit 110 calculates, from thespeech with noise superimposed thereon x_(m)(t), a posterioriprobability sequence {q_(m)(c, i)} (i=0, 1, . . . , I_(m), where I_(m)is an integer greater than or equal to 0) which is a sequence of theposteriori probabilities q_(m)(c, i) that a frame i of the speech withnoise superimposed thereon x_(m)(t) is a class c (S110).

The attribute identification result generating unit 230 generates anattribute identification result L_(m) of the speech s_(m)(t) from theposteriori probability sequence {q_(m)(c, i)} (i=0, 1, . . . , I_(m)) ofthe class c (S230).

Specifically, the attribute identification result generating unit 230obtains a most probable estimated class c{circumflex over ( )}_(m) bythe following formula and sets the most probable estimated classc{circumflex over ( )}_(m) as the attribute identification result L_(m).

$\begin{matrix}{{\overset{\hat{}}{c}}_{m} = {\underset{c}{\arg \mspace{14mu} \max}\left( {\sum\limits_{i = 0}^{I_{m}}{\log {q_{m}\left( {c,i} \right)}}} \right)}} & (7)\end{matrix}$

The reliability label generating unit 240 generates a reliability labelr_(m), which is used for learning of the reliability calculation model,from the attribute identification result L_(m) by using the attributelabel A_(m) of the speech s_(m)(t) (S240). For instance, r_(m)=1 holdsif L_(m)=A_(m) (that is, the attribute identification result is a rightattribute identification result); r_(m)=0 holds otherwise (that is, theattribute identification result is not a right attribute identificationresult).

$\begin{matrix}{r_{m} = \left\{ \begin{matrix}1 & \left( {L_{m} = A_{m}} \right) \\0 & \left( {L_{m} \neq A_{m}} \right)\end{matrix} \right.} & (8)\end{matrix}$

The reliability label generating unit 240 records a posterioriprobability sequence with a reliability label, which is a tuple of aposteriori probability sequence {q_(m)(c{circumflex over ( )}_(m), i)}(i=0, 1, . . . , I_(m)) of the most probable estimated classc{circumflex over ( )}_(m) and the reliability label r_(m), on therecording unit 290 and creates a database on a posteriori probabilitysequence with a reliability label.

By using the database on a posteriori probability sequence with areliability label, the reliability calculation model learning unit 250learns a reliability calculation model λ_(r) that outputs thereliability of a class c by using a posteriori probability sequence ofthe class c as input (S250). Since the reliability calculation modelλ_(r) handles time series data, it can be configured as, for example, aneural network such as long short-term memory (LSTM) or a recurrentneural network (RNN).

FIGS. 10A, 10B, and 10C show time variations in posteriori probability.FIG. 10A shows variations in posteriori probability observed when noisewas not superimposed and a correct identification result was obtained,FIG. 10B shows variations in posteriori probability observed when acorrect identification result was obtained for speech with noisesuperimposed thereon, and FIG. 10C shows variations in posterioriprobability observed when a correct identification result was notobtained for speech with noise superimposed thereon. There are twodifferences, which will be described below, between FIG. 10B and FIG.10C.

When a correct identification result was obtained as in FIG. 10B, aparticular class tended to exhibit a high posteriori probability,whereas, when a correct identification result was not obtained as inFIG. 10C, a plurality of classes alternately exhibited a high posterioriprobability with the passage of time. Moreover, when a correctidentification result was obtained as in FIG. 10B, the posterioriprobability remained at a value close to 1 after a lapse of some time,whereas, when a correct identification result was not obtained as inFIG. 10C, the posteriori probability did not exhibit a relatively highvalue even after a lapse of time and, even if the posteriori probabilityexhibited a high value, the duration was relatively short.

As described above, since a pattern of time variations in posterioriprobability when a correct identification result was obtained isdifferent from a pattern of time variations in posteriori probabilitywhen a correct identification result was not obtained, the reliabilitycalculation model Δ_(r) can be learned as a model that handles timeseries data, which makes it possible to calculate reliability.

According to the present invention, by rejecting an attributeidentification result when the reliability thereof, which indicates thecertainty of the attribute identification result, is low, it is possibleto prevent impairment of usability and prevent displeasure caused bypresenting an unreliable identification result to the user.

APPENDIX

Each device according to the present invention has, as a single hardwareentity, for example, an input unit to which a keyboard or the like isconnectable, an output unit to which a liquid crystal display or thelike is connectable, a communication unit to which a communicationdevice (for example, communication cable) capable of communication withthe outside of the hardware entity is connectable, a central processingunit (CPU, which may include cache memory and/or registers), RAM or ROMas memories, an external storage device which is a hard disk, and a busthat connects the input unit, the output unit, the communication unit,the CPU, the RAM, the ROM, and the external storage device so that datacan be exchanged between them. The hardware entity may also include, forexample, a device (drive) capable of reading and writing a recordingmedium such as a CD-ROM as desired. A physical entity having suchhardware resources may be a general-purpose computer, for example.

The external storage device of the hardware entity has stored thereinprograms necessary for embodying the aforementioned functions and datanecessary in the processing of the programs (in addition to the externalstorage device, the programs may be prestored in ROM as a storage deviceexclusively for reading out, for example). Also, data or the likeresulting from the processing of these programs are stored in the RAMand the external storage device as appropriate.

In the hardware entity, the programs and data necessary for processingof the programs stored in the external storage device (or ROM and thelike) are read into memory as necessary to be interpreted andexecuted/processed as appropriate by the CPU. As a consequence, the CPUembodies predetermined functions (the components represented above asunits, means, or the like).

The present invention is not limited to the above embodiments, butmodifications may be made within the scope of the present invention.Also, the processes described in the embodiments may be executed notonly in a chronological sequence in accordance with the order of theirdescription but may be executed in parallel or separately according tothe processing capability of the device executing the processing or anynecessity.

As already mentioned, when the processing functions of the hardwareentities described in the embodiments (the devices of the presentinvention) are to be embodied with a computer, the processing details ofthe functions to be provided by the hardware entities are described by aprogram. By the program then being executed on the computer, theprocessing functions of the hardware entity are embodied on thecomputer.

The program describing the processing details can be recorded on acomputer-readable recording medium. The computer-readable recordingmedium may be any kind, such as a magnetic recording device, an opticaldisk, a magneto-optical recording medium, or a semiconductor memory.More specifically, a magnetic recording device may be a hard diskdevice, flexible disk, or magnetic tape; an optical disk may be a DVD(digital versatile disc), a DVD-RAM (random access memory), a CD-ROM(compact disc read only memory), or a CD-R (recordable)/RW (rewritable);a magneto-optical recording medium may be an MO (magneto-optical disc);and a semiconductor memory may be EEP-ROM (electronically erasable andprogrammable-read only memory), for example.

Also, the distribution of this program is performed by, for example,selling, transferring, or lending a portable recording medium such as aDVD or a CD-ROM on which the program is recorded. Furthermore, aconfiguration may be adopted in which this program is distributed bystoring the program in a storage device of a server computer andtransferring the program to other computers from the server computer viaa network.

The computer that executes such a program first, for example,temporarily stores the program recorded on the portable recording mediumor the program transferred from the server computer in a storage devicethereof. At the time of execution of processing, the computer then readsthe program stored in the storage device thereof and executes theprocessing in accordance with the read program. Also, as another form ofexecution of this program, the computer may read the program directlyfrom the portable recording medium and execute the processing inaccordance with the program and, furthermore, every time the program istransferred to the computer from the server computer, the computer maysequentially execute the processing in accordance with the receivedprogram. Also, a configuration may be adopted in which the transfer of aprogram to the computer from the server computer is not performed andthe above-described processing is executed by so-called applicationservice provider (ASP)-type service by which the processing functionsare implemented only by an instruction for execution thereof and resultacquisition. Note that a program in this form shall encompassinformation that is used in processing by an electronic computer andacts like a program (such as data that is not a direct command to acomputer but has properties prescribing computer processing).

Further, although the hardware entity was described as being configuredvia execution of a predetermined program on a computer in this form, atleast some of these processing details may instead be embodied withhardware.

The foregoing description of the embodiments of the invention has beenpresented for the purpose of illustration and description. It is notintended to be exhaustive and to limit the invention to the precise formdisclosed. Modifications or variations are possible in light of theabove teaching. The embodiment was chosen and described to provide thebest illustration of the principles of the invention and its practicalapplication, and to enable one of ordinary skill in the art to utilizethe invention in various embodiments and with various modifications asare suited to the particular use contemplated. All such modificationsand variations are within the scope of the invention as determined bythe appended claims when interpreted in accordance with the breadth towhich they are fairly, legally, and equitably entitled.

1. An attribute identification device, wherein if I is assumed to be aninteger greater than or equal to 0 and a set of classes for identifyinga speaker of uttered speech is assumed to be an attribute, the attributeidentification device comprises: a posteriori probability calculationunit that calculates, from input speech s(t), a posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I) which is a sequence of posterioriprobabilities q(c, i) that a frame i of the input speech s(t) is a classc; a reliability calculation unit that calculates, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c)indicating an extent to which the class c is a correct attributeidentification result; and an attribute identification result generatingunit that generates an attribute identification result L of the inputspeech s(t) from the posteriori probability sequence {q(c, i)} (i=0, 1,. . . , I) and the reliability r(c), and the attribute identificationresult generating unit obtains a most probable estimated classc{circumflex over ( )}, which is a class that is estimated to be a mostprobable attribute, from the posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I) and sets ϕ indicating rejection as the attributeidentification result L if reliability r(c{circumflex over ( )}) of themost probable estimated class c{circumflex over ( )} falls within apredetermined range indicating that the reliability r(c{circumflex over( )}) is low and sets the most probable estimated class c{circumflexover ( )} as the attribute identification result L otherwise.
 2. Anattribute identification device, wherein if I is assumed to be aninteger greater than or equal to 0 and a set of classes for identifyinga speaker of uttered speech is assumed to be an attribute, the attributeidentification device comprises: a posteriori probability calculationunit that calculates, from input speech s(t), a posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I) which is a sequence of posterioriprobabilities q(c, i) that a frame i of the input speech s(t) is a classc; a reliability calculation unit that calculates, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c)indicating an extent to which the class c is a correct attributeidentification result; and an attribute identification result generatingunit that generates an attribute identification result L of the inputspeech s(t) from the reliability r(c), and the attribute identificationresult generating unit obtains a most probable estimated classc{circumflex over ( )}, which is a class that is estimated to be a mostprobable attribute, from the reliability r(c) and sets ϕ indicatingrejection as the attribute identification result L if reliabilityr(c{circumflex over ( )}) of the most probable estimated classc{circumflex over ( )} falls within a predetermined range indicatingthat the reliability r(c{circumflex over ( )}) is low and sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L otherwise.
 3. An attribute identificationdevice, wherein if 1 is assumed to be an integer greater than or equalto 0 and a set of classes for identifying a speaker of uttered speech isassumed to be an attribute, the attribute identification devicecomprises: a posteriori probability calculation unit that calculates,from input speech s(t), a posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I) which is a sequence of posteriori probabilities q(c,i) that a frame i of the input speech s(t) is a class c; and anattribute identification result generating unit that generates anattribute identification result L of the input speech s(t) from theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I), and theattribute identification result generating unit includes a reliabilitycalculation unit that calculates, from the posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c) indicating anextent to which the class c is a correct attribute identificationresult, and obtains a most probable estimated class c{circumflex over( )}, which is a class that is estimated to be a most probableattribute, from the posteriori probability sequence {q(c, i)} (i=0, 1, .. . , I), calculates reliability r(c{circumflex over ( )}) of the mostprobable estimated class c{circumflex over ( )} by using the reliabilitycalculation unit, and sets ϕ indicating rejection as the attributeidentification result L if the reliability r(c{circumflex over ( )})falls within a predetermined range indicating that the reliabilityr(c{circumflex over ( )}) is low and sets the most probable estimatedclass c{circumflex over ( )} as the attribute identification result Lotherwise.
 4. The attribute identification device according to any oneof claims 1 to 3, wherein the reliability calculation unit calculatesthe reliability r(c) by using a reliability calculation model thatoutputs reliability of the class c by using a posteriori probabilitysequence of the class c as input.
 5. An attribute identification method,wherein if I is assumed to be an integer greater than or equal to 0 anda set of classes for identifying a speaker of uttered speech is assumedto be an attribute, the attribute identification method comprises: aposteriori probability calculation step in which an attributeidentification device calculates, from input speech s(t), a posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I) which is a sequenceof posteriori probabilities q(c, i) that a frame i of the input speechs(t) is a class c; a reliability calculation step in which the attributeidentification device calculates, from the posteriori probabilitysequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c) indicating anextent to which the class c is a correct attribute identificationresult; and an attribute identification result generation step in whichthe attribute identification device generates an attributeidentification result L of the input speech s(t) from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I) and the reliabilityr(c), and the attribute identification result generation step obtains amost probable estimated class c{circumflex over ( )}, which is a classthat is estimated to be a most probable attribute, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , 1) and sets ϕ indicatingrejection as the attribute identification result L if reliabilityr(c{circumflex over ( )}) of the most probable estimated classc{circumflex over ( )} falls within a predetermined range indicatingthat the reliability r(c{circumflex over ( )}) is low and sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L otherwise.
 6. An attribute identificationmethod, wherein if I is assumed to be an integer greater than or equalto 0 and a set of classes for identifying a speaker of uttered speech isassumed to be an attribute, the attribute identification methodcomprises: a posteriori probability calculation step in which anattribute identification device calculates, from input speech s(t), aposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I) which is asequence of posteriori probabilities q(c, i) that a frame i of the inputspeech s(t) is a class c; a reliability calculation step in which theattribute identification device calculates, from the posterioriprobability sequence {q(c, i)} (i=0, 1, . . . , I), reliability r(c)indicating an extent to which the class c is a correct attributeidentification result; and an attribute identification result generationstep in which the attribute identification device generates an attributeidentification result L of the input speech s(t) from the reliabilityr(c), and the attribute identification result generation step obtains amost probable estimated class c{circumflex over ( )}, which is a classthat is estimated to be a most probable attribute, from the reliabilityr(c) and sets ϕ indicating rejection as the attribute identificationresult L if reliability r(c{circumflex over ( )}) of the most probableestimated class c{circumflex over ( )} falls within a predeterminedrange indicating that the reliability r(c{circumflex over ( )}) is lowand sets the most probable estimated class c{circumflex over ( )} as theattribute identification result L otherwise.
 7. An attributeidentification method, wherein if I is assumed to be an integer greaterthan or equal to 0 and a set of classes for identifying a speaker ofuttered speech is assumed to be an attribute, the attributeidentification method comprises: a posteriori probability calculationstep in which an attribute identification device calculates, from inputspeech s(t), a posteriori probability sequence {q(c, i)} (i=0, 1, . . ., I) which is a sequence of posteriori probabilities q(c, i) that aframe i of the input speech s(t) is a class c; and an attributeidentification result generation step in which the attributeidentification device generates an attribute identification result L ofthe input speech s(t) from the posteriori probability sequence {q(c, i)}(i=0, 1, . . . , I), and the attribute identification result generationstep includes a reliability calculation step of calculating, from theposteriori probability sequence {q(c, i)} (i=0, 1, . . . , I),reliability r(c) indicating an extent to which the class c is a correctattribute identification result, and obtains a most probable estimatedclass c{circumflex over ( )}, which is a class that is estimated to be amost probable attribute, from the posteriori probability sequence {q(c,i)} (i=0, 1, . . . , I), calculates reliability r(c{circumflex over( )}) of the most probable estimated class c{circumflex over ( )} byusing the reliability calculation step, and sets ϕ indicating rejectionas the attribute identification result L if the reliabilityr(c{circumflex over ( )}) falls within a predetermined range indicatingthat the reliability r(c{circumflex over ( )}) is low and sets the mostprobable estimated class c{circumflex over ( )} as the attributeidentification result L otherwise.
 8. A non-transitory computer-readablestorage medium which stores a program for making a computer function asthe attribute identification device according to any one of claims 1 to3.
 9. A non-transitory computer-readable storage medium which stores aprogram for making a computer function as the attribute identificationdevice according to claim 4.